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Abstract 

Consider a regression model with infinitely many parameters and time series errors. We are 
interested in choosing weights for averaging across generalized least squares (GLS) estimators 
obtained from a set of approximating models. However, GLS estimators, depending on the 
unknown inverse covariance matrix of the errors, are usually infeasible. We therefore construct 
feasible generalized least squares (FGLS) estimators using a consistent estimator of the unknown 
inverse matrix. Based on this inverse covariance matrix estimator and FGLS estimators, we de¬ 
velop a feasible autocovariance-corrected Mallows model averaging criterion to select weights, 
thereby providing an FGLS model averaging estimator of the true regression function. We show 
that the generalized squared error loss of our averaging estimator is asymptotically equivalent 
to the minimum one among those of GLS model averaging estimators with the weight vectors 
belonging to a continuous set, which includes the discrete weight set used in Hansen (2007) as 
its proper subset. 
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1 Introduction 

This article is concerned with the implementation of model averaging methods in regression models 
with time series errors. We are interested in choosing weights for averaging across generalized 
least squares (GLS) estimators obtained from a set of approximating models for the true regression 
function. However, GLS estimators, depending on the unknown covariance matrix S" 1 of the errors, 
are usually infeasible, where n is the sample size. We therefore construct feasible generalized least 
squares (FGLS) estimators using a consistent estimator of U” 1 . Based on this inverse covariance 
matrix estimator and FGLS estimators, we develop a feasible autocovariance-corrected Mallows 
model averaging (FAMMA) criterion to select weights, thereby providing an FGLS model averaging 
estimator of the regression function. We show that the generalized squared error loss of our averaging 
estimator is asymptotically equivalent to the minimum one among those of GLS model averaging 
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estimators with the weight vectors belonging to a continuous set, which includes the discrete weight 
set used in Hansen (2007) as its proper subset. 

Let M be the number of approximating models. If the weight set only contains standard unit 
vectors in R M , then selection of weights for model averaging is equivalent to selection of models. 
Therefore, model selection can be viewed as a special case of model averaging. It is shown in Hansen 
(2007, p.1179) that when the weight set is rich enough, the optimal model averaging estimator 
usually outperforms the one obtained from the optimal single model, providing ample reason to 
conduct model averaging. Another vivid example demonstrating the advantage of model averaging 
over model selection is given by Yang (2007, Section 6.2.1, Figure 5). In the case of independent 
errors, asymptotic efficiency results for model selection have been reported extensively, even when 
the errors are heteroskedastic or regression functions are serially correlated. For the regression 
model with i.i.d. Gaussian errors, Shibata (1981) showed that Mallows’ C p (Mallows (1973)) and 
Akaike information criterion (AIC; Akaike (1974)) lead to asymptotically efficient estimators of 
the regression function. By making use of Whittle’s (1960) moment bounds for quadratic forms 
in independent variables, Li (1987) established the asymptotic efficiency of Mallows’ C p under 
much weaker assumptions on homogeneous errors. Li’s (1987) result was subsequently extended by 
Andrews (1991) to heteroscedastic errors. There are also asymptotic efficiency results established in 
situation where regression functions are serially correlated. Assuming that the data are generated 
from an infinite order autoregressive (AR(oo)) process driven by i.i.d. Gaussian noise, Shibata (1980) 
showed that AIC is asymptotically efficient for independent-realization prediction. This result was 
extended to non-Gaussian AR(oo) processes by Lee and Karagrigoriou (2001). Ing and Wei (2005) 
showed that AIC is also asymptotically efficient for same-realization prediction. Ing (2007) further 
pointed out that the same property holds for a modification of Rissanen’s accumulated prediction 
error (APE, Rissanen (1986)) criterion. 

Asymptotic efficiency results for model averaging have also attracted much recent attention from 
econometricians and statisticians. Hansen (2007) proposed the Mallows model averaging (MMA) 
criterion, which selects weights for averaging across LS estimators. Under regression models with 
i.i.d. explanatory vectors and errors, he proved that the averaging estimator obtained from the 
MMA criterion asymptotically attains the minimum squared error loss among those of the LS 
model averaging estimators with the weight vectors contained in a discrete set % n {N) (see (2.8)), 
in which N is a positive integer and related to the moment restrictions of the errors. Using the 
same weight set, Hansen and Racine (2012) and Liu and Okui (2013), respectively, showed that 
the Jackknife model averaging (JMA) criterion and feasible HRC P criterion yield asymptotically 
efficient LS model averaging estimators in regression models with independent explanatory vectors 
and heteroscedastic errors. Since ?7 n (IV) is quite restrictive when N is small, Wan, Zhang and Zou 
(2010) justified MMA’s asymptotic efficiency over the continuous weight set 

M 

Gn = {w = (wi,...,w M ) : w m € [0,1], w m = 1}, (1.1) 

m= 1 

which is much more flexible than T~L n (N). Recently, Ando and Li (2014) showed that Hansen and 
Racine’s (2012) result carries over to high-dimensional regression models and to a weight set more 
general than Q n . 

There are different types of theoretical examinations on model averaging. Besides the approach 
of targeting asymptotic efficiency, another very successful approach is minimax optimal model 
combination via oracle inequalities; see, for example, Yang (2001), Yuan and Yang (2005), Leung 
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and Barron (2006), and Wang et al. (2014). 

However, all aforementioned papers, requiring the error terms to be independent, preclude the 
regression model with time series errors, which is one of the most useful models for analyzing de¬ 
pendent data. In this article, we take the first step to close this gap by introducing the FAMMA 
criterion and proving its asymptotic efficiency in the sense mentioned in the first paragraph. How¬ 
ever, minimax optimality results are not pursued here. Our criterion has some distinctive features. 
First, it involves estimation of the high-dimensional inverse covariance matrix of a stationary time 
series that is not directly observable. Note that the covariance matrix of a stationary time series of 
length n can be viewed as a high-dimensional covariance matrix because its dimension is equivalent 
to the sample size. In situations where the error process is observable (or equivalently, the regression 
functions are known to be zero), Wu and Pourahmadi (2009) proposed a banded covariance matrix 
estimator of S n and proved its consistency under spectral norm, which also leads to the consistency 
of the corresponding inverse matrix in estimating These results were then extended by Mc- 

Murry and Politis (2010) to tapered covariance matrix estimators. However, since the error process 
is in general unobservable, one can only estimate Ti~ l (or E n ) through the output variables. As 
far as estimating E” 1 is concerned, these output variables are contaminated by unknown regression 
functions. In Section 3, we propose estimating E” 1 by its banded Cholesky decomposition with 
the corresponding parameters estimated nonparametrically from the least squares residuals of an in¬ 
creasing dimensional approximating model. We also obtain the rate of convergence of the proposed 
estimator, which plays a crucial role in proving the asymptotic efficiency of the FAMMA criterion. 
Second, our criterion is justified under a continuous weight set Hn (see (2.10)). While Hn is not 
as general as Q n , as argued in Section 2, it can substantially reduce the limitations encountered by 
7i n (N) when N is small. 

It is worth mentioning that to justify MMA’s asymptotic efficiency over the weight set Q n . 
Wan, Zhang and Zou (2010) required a stringent condition on M; see (2.20) of Section 2. As 
argued in Remark 4, this condition may preclude the approximating models whose estimators have 
the minimum risk (ignoring constants). When these models/estimators are precluded, the MMA 
criterion can only select weights for a set of suboptimal models/estimators, which is obviously 
not desirable. In fact, the same dilemma also arises in Ando and Li (2014), who used a similar 
assumption to prove their asymptotic efficiency results. Zhang, Wan and Zou (2013) considered 
model averaging problems in regression models with dependent errors. They adopted the JMA 
criterion to choose weights for a class of estimators and showed that the criterion is asymptotically 
efficient over the weight set Q n . Their result, however, is still reliant on a condition similar to (2.20). 
In addition, the class of estimators considered in their paper, excluding all FGLS estimators, may 
suffer from lack of efficiency. 

The remaining paper is organized as follows. In Section 2, we first concentrate on the case 
where E n is known. We show in Theorem 1 that the autocovariance-corrected Mallows model 
averaging (AMMA) criterion, which is the FAMMA criterion with the estimator of E” 1 replaced 
by E” 1 itself, is asymptotically efficient. Since the assumptions used in Theorem 1 are rather mild, 
both Corollary 2.1 of Li (1987) and Theorem 1 of Hansen (2007) become its special case. We 
then turn attention to the more practical situation where E n is unknown and propose choosing 
model weights by the FAMMA criterion. It is shown in Theorem 2 of Section 2 that the FAMMA 
criterion is asymptotically efficient as long as the corresponding estimator of E” 1 has a sufficiently 
fast convergence rate. In Section 3, we provide a consistent estimator of E" 1 based on its banded 
Cholesky decomposition, and derive the estimator’s convergence rate under various situations. In 
Section 4, the asymptotic efficiency of the FAMMA criterion with E" 1 estimated by the method 
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proposed in Section 3 is established. Finally, we conclude in Section 5. All proofs are relegated to 
the Appendix in order to maintain the flow of exposition. 

2 The AMMA and FAMMA criteria 

Consider a regression model with infinitely many parameters, 

OO 

yt = ^2 + et = Ht + e t , t = 1,..., n, (2.1) 

3 = 1 

where [it = IxJLi ®j x t]> x t = (%ti, x t 2 , ■ ■ ■)' is the explanatory vector with sup t>l j>1 E(x^) < oo, 
6j. j > 1 are unknown parameters satisfying II < oo, and {et}, independent of {x t }, is 

an unobservable stationary process with zero mean and finite variance. In matrix notation, Y n = 
+ e n , where Y n = (yi ,..., y n )', [i n = (/zi,and e n = (ei,..., e n )'. The central focus of 
this paper is to explore how and to what extent the model averaging can be implemented in the 
presence of time series errors. 

Let m = 1,... ,M be a set of approximating models of (2.1), where the mth model uses the 
first k m elements of { Xt} with 1 < k\ < k 2 < ■ ■ ■ < kM < n and M is allowed to grow to infinity 
with the sample size n. Assume that T, n = E(e n e(J is known and E” 1 exists. Then the generalized 
least squares (GLS) estimator of the regression coefficient vector in the mth approximating model 
is given by 0^ = (A( n E“ 1 A m ) _1 X , m E~ 1 Y n , and the resultant estimate of n n is fi n {m ) = 
where X m = P ^ = X rn (X' n Y~ 1 X m y 1 X' rn Y,~ 1 , and X M is assumed to be almost 

surely (a.s.) full rank throughout the paper. The model averaging estimator of n n based on the 
Mth approximating models is fi n (w) = P*(w)Y n , where w € Q n and P*(w) = )T) m=] w m P* n . To 
evaluate the performance of fi n {w), we use the generalized squared error (GSE) loss 

L *n( W ) = (A n( w ) - ~ /*n)- 

This loss function is a natural generalization of Hansen’s (2007) average squared error in the sense 
that L*(ic) reduces to the latter when Y~ l is replaced by the n x n identity matrix. Through 
this generalization, it is easy to establish a connection between our results and some classical 
asymptotic efficiency results on model averaging/selection, thereby leading to a more comprehensive 
understanding of this research held. For further discussion, see Remarks 1 and 2 below. On the 
other hand, when the future values of yt are entertained instead of the regression function /z n , Wei 
and Yang (2012) proposed several different loss functions from a prediction point of view. Compared 
with the squared errors, their loss functions are particularly suitable for dealing with outliers. 

The next lemma provides a representation for the conditional risk, 

R* n (w) = E (L* n (w)\ Xl , ...,x n ) = E x (L* n (w)), 
which is an extension of Lemma 2 of Hansen (2007) to the case of dependent errors. 

Lemma 1. Assume (2.1) and E^ 1 exists. Then, for any w S Q n , 

M M 

KM = E E w m w l[lX n Y n / (/ 

-^max{m, 0) S n 1/2 Rn + min{A: m ,^}], (2.2) 

m= 1 1=1 

where Pj = Y n 1 ^ 2 Xj(XjY~ 1 Xj)~ 1 XjY, n 1 ^ 2 is the orthogonal projection matrix for the column space 
ofYf 1/2 Xj. 
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To choose a data-driven weight vector asymptotically minimizing L* (w) over a suitable weight 
set H C Q n , we propose the AMMA criterion, 

M 

C*n(w) = (Y n - fi n (w))'Y~ l {Y n - fi n (w )) + 2^2 w mkm • (2.3) 

m— 1 

Note that the AMMA criterion is a special case of the criterion given in (6.2*) of Andrews (1991) 
with M n (h) = P*(w ) and W = Y~ l . It reduces to the MMA criterion when 

{et} is a sequence of i.i.d. random variables with E(ei) = 0,0 < E(e^) = a 2 < oo. (2-4) 


Recently, Liu, Okui and Yoshimura (2013) also suggested using C*(w ) to choose weight vectors in 
situations where et are independent but possibly heteroscedastic. While the AMMA criterion is not 
new to the literature, the question of whether its minimizer can (asymptotically) minimize L* (w) 
seems rarely discussed, in particular when ~H is uncountable. 

Recall that by assuming (2.4), 


£, n = inf R*(m) -» oo a.s., 

W&Qn 


(2.5) 


and 


E(|ei| 4 ( 7V + 1 ) |a?i) < k < oo a.s., for some positive integer N, 

Hansen (2007, Theorem 1) showed that C*(w) is asymptotically efficient in the sense that 

Tn(th n ) 


^■wGHn(N) L*n( w ) P 

where —> p denotes convergence in probability, 


1, 


( 2 . 6 ) 


(2.7) 


M 


'Hn(N) = {w :w m <E {0, 1/N, 2/N, ...l},^w m = 1}, 


( 2 . 8 ) 


m= 1 


and 


w n = arg min C*(w). 

W&Un{N) 

Equation (2.7) gives a positive answer to the above question in the special case where Y n = a 2 I n 
and T~i = 'Hn(N) is a discrete set. When (2.6) holds for sufficiently large N, the restriction of Q n to 
'Hn(N) is not an issue of overriding concern because the grid points i/N, i = 0,..., N, in T~L n (N ) is 
dense enough to provide a good approximation for the optimal weight vector among 

M M 

i-Ln{N) = {w :w m £ [0,1], 1 < ^2 0} < N ,^2 w m = 1}, (2.9) 

m =1 m= 1 

and hence among Q n . We call 'H n (N) continuous extension of 'H n (N) because it satisfies 'H n (N) C 
'Hn(N) and aw\ + bw 2 G 'Hn(N), for any 0 < a, b < 1 with a + b = 1 and any w\ = (ten,... wm 1 ) 
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and w 2 = (W 12 , ■ ■ ■ w M 2 )' € 77 n (7V) with J2m=i l^mi^o} - 1 {w rn 2 ^}\ = 11 is shown in Hansen 

(2007, p.1179) that even when N = 2, the optimal weight vector in 7~L n (N) can yield an averaging 
estimator outperforming the one based on the optimal single model, except in some special cases. 

On the other hand, when (2.6) holds only for moderate or small values of N, not only the 
optimal weight vector in Q n but also that in 77 n (lV) cannot be well approximated by the elements in 
'H n (N). As a result, the advantage of model averaging over model selection becomes less apparent. 
To rectify this deficiency, we introduce the following continuous extension of' H n (N ), 

N 

= [J'H(I), ( 2 . 10 ) 

l=\ 


where 


M M 

ft(Z) = {w :5< Wil {w . m < 1, ^ Wo} = l,^w m = 1}, 

i —1 m =1 

with 0 < 5 < 1/N. The number of non-zero component is 1 < l < N for any vector in 77(q. 
Therefore, this weight set leads to sparse combinations of p, n (m),m = 1 For a detailed 

discussion on sparse combinations from the minimax viewpoint, see Wang et al. (2014). We will 
show in Theorem 1 that 


T n ('th n ) _^ 1 

inf wen N L^(w) p 


( 2 . 11 ) 


without the restriction = a 2 I n , where 


w n = arg inf C*{w). 
wen N 

It is important to be aware that 77 n can inherit the benefits of i~L n {N) mentioned previously 
because the difference between the two sets can be made arbitrarily small by making 5 sufficiently 
close to 0. Technically speaking, a nonzero (regardless of how small) 5 enables us to establish some 
sharp uniform probability bounds through replacing 72* (to) by suitable model selection risks (see 
(A.7)), thereby overcoming the difficulties arising from the uncountablility of 77 at. In Remarks 3 and 
4 after Theorem 1, we will also discuss the asymptotic efficiency of C*(w ) over more general weight 
sets such as Q n and its variants. The following assumptions on {et\ are needed in our analysis. 
As shown in the Appendix, these assumptions allow us to derive sharp bounds for the moments of 
quadratic forms in {et} using the first moment bound theorem of Findley and Wei (1993). 

Assumption 1. {et} is a sequence of stationary time series with autocovariance function (ACF) 
7 j = E(etet + j) satisfying Yl°jL-ool‘j < °°> an d admits a linear representation 

OO 

&t = ott + fikOit-k ( 2 - 12 ) 

k =1 

in terms of the J-t- m easurable random variables at , where 77,-00 < t < 00 is an increasing sequence 
of c-fields of events. Moreover, {at} satisfies the following properties with probability 1: 

(Ml) E(a t \Tt-i) = 0. 
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(M2) E{a\\T t _ 1 ) = al. 


(M3) There exist a positive integer N and a positive number S > 4 N such that for some constant 
0 < Cs < oo, 

sup E(\a t \ s \J r t-i) < Cs- (2.13) 

—oo<t<oo 

Assumption 2. The spectral density function of {et}, 

OO 

/e(A) = (a 2 J 2n)\J2^e-^\ 2 ± 0 (2.14) 

3=0 


for all — 7r < A < 7r, where fio = 1 ■ Moreover, 


E ifti < 

3=0 


OO. 


(2.15) 


We are now in a position to state Theorem 1. 

Theorem 1. Assume Assumptions 1 and 2 in which N in (M3) is fixed. Let 


D n (m) = n' n E n 1/2 (I - P m )E n 1/2 p n + k m and k* = min D n (m) 

l<m<M 


Suppose 


fc* —> oo a.s. 


(2.16) 


Then, (2.11) follows. 

A few comments on Theorem 1 are in order. 


Remark 1. Theorem 1 generalizes Theorem 1 of Hansen (2007) in several directions. First, 
(2.4) is a special case of (2.12), with fik = 0 for all k > 1. Second, the discrete weight set ’H n (N) is 
extended to its continuous extension TLn- Third, when (2.4) holds and {et} is independent of {x t }, 
the moment condition (2.13) is milder than (2.6). Fourth , (2.5) is weakened to (2.16), which is much 
easier to verify. Note that £;* can be viewed as an index of the amount of information contained in 
the candidate models. Therefore, (2.16) is quite natural from the estimation theoretical viewpoint; 
see, e.g., Lai and Wei (1982), Yu, Lin and Cheng (2012) and Chan, Huang and Ing (2013). Suppose 
there exists a non-random and non-negative function Q(m) satisfying 


sup 

1 <m<M 




l y~ 1 / 2 (r_p iy -1 / 2 


Hr 


n 


- Q{m) 


0, a.s. 


Then, (2.16) is fulfilled if Q(m) 0 for all m and linim^oo Q[m) = 0, which essentially require 
that all candidate models are misspecified, but those which have many parameters can give good 
approximations of the true model. 
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Remark 2. Corollary 2.1 of Li (1987) also becomes a special case of Theorem 1. To see this, 
note that under (2.4), (2.16) and 

E(ef) < oo, (2-17) 

Li’s (1987) Corollary 2.1 shows that (2.11) holds with N = 1, namely, C*(w) is asymptotically 
efficient for model selection. However, since (2.4) and (2.17) also imply Assumptions 1 and 2, Li’s 
conclusion readily follows from Theorem 1. 


Remark 3. It is far from being trivial to extend (2.11) to 

L* n (w°) . 1 

inf™e< 5 „ L*(w) p 

where 


(2.18) 


w° = arg inf C*(w). 

Alternatively, if at have light-tailed distributions, such as those described in (C 2 ) and (C3) of Ing 
and Lai (2011), then by using the exponential probability inequalities developed in the same papers 
in place of the moment inequalities given in the proof of Theorem 1, it can be shown that (2.11) 
holds with 5 tending to 0 and N = M tending to oo sufficiently slowly with n. The details, however, 
are not reported here due to space constraints. 

Remark 4. When (2.4) holds true, (2.18) has been developed in Theorem V of Wan, Zhang 
and Zou (2010) under 

E(\e\t G \x t ) < k < oo a.s., (2-19) 


for some integer 1 < G < oo, and 

M 

Mi~ 2G £ D n{m) -> o a.s. (2.20) 

m= 1 

Unfortunately, (2.20), imposing a stringent restriction on M, often precludes models having small 
GSE losses. To see this, assume that Xt are nonrandom and the mth approximating model contains 
the first m regressors, namely k m = m. Assume also that 

D n (m) = nm~ a + m (2-21) 

and the G in (2.19) is greater than 1/a for some a > 1. It is easy to show that D n (m ) is minimized 
by m ~ (an) 1 /( 1+a \ yielding the optimal rate of D n (m), in}^ 1+a \ Moreover, as will be clear from 
(A.5), D n (m ) = R*(u m ) is asymptotically equivalent to L*(u m ), where v m is the mth standard 
unit vector in R AI . Hence the optimal rate of L*(u m ) is also n l ^ 1+a \ which is achievable by any 
approximating model whose number of regressors m satisfying 

cin 1 /( 1+a ) < m < C 2 n 1 /( 1+a \ for some 0 < ci < C 2 < oo. ( 2 . 22 ) 

If M ~ con 1 /( 1+a ), where cq is any positive number, then for G > 1/a with a > 1, there exists 
C 3 > 0 such that M£“ 2G ' J2m=i D G {m) > c^Mn 0 /n 2G ^ 1+a ' ) —> oo as n —>• oo, which violates ( 2 . 20 ). 



In fact, it is shown in Example 2 of Wan, Zhang and Zou (2010) that a sufficient condition for (2.20) 
to hold is M = 0(n v ) with v < G/(l + 2 aG) < 1/(1 + a). These facts reveal that all models with 
Ln( v m) achieving the optimal rate n 1//( ' 1+a ^ (or equivalently, with m obeying (2.22)) are excluded 
by (2.20). Under such a situation, C*(w) can only select weights for a set of suboptimal models. 
Therefore, it is hard to conclude from their Theorem 1’ that /i n (to°)’s GSE loss is asymptotically 
smaller than that of p, n (m) with m satisfying (2.22), even though this theorem guarantees C*(w)’s 
asymptotic efficiency in the sense of (2.18). On the other hand, since Theorem 1 does not impose any 
restrictions similar to (2.20), one is free to choose M ~ Cn l ^ 1+a \ with C sufficiently large, so as to 
include the optimal model m ~ (an) 1 ^ 1+a \ In addition, by noticing fc* = mini< m <M D n (m) —>• oo 
as n —>• oo, we know from Theorem 1 that w n satisfies (2.11), and hence p> n (w n ) asymptotically 
outperforms the best one among fi n (m), 1 < m < Cn l ^ 1+a \ in terms of GSE loss. When a is 
unknown, the bound M r^j Cn 1 /( 1+a ' ) is infeasible. However, if a strict lower bound for o, say a, is 
known a priori, then the same conclusion still holds for M C*n 1 /( 1 +^) with any C* > 0. 

Remark 5. It is worth noting that estimating the weight that minimizes L*(w) will generally 
introduce a variance inflation factor, which may prevent us from obtaining the asymptotic effi¬ 
ciency. Under independent errors, a recent paper by Wang et al (2014) gives a comprehensive 
discussion of this matter from the minimax viewpoint. In fact, pursuing the minimax optimal rate 
is more relevant than the asymptotic efficiency in the presence of a large variance inflation factor. 
On the other hand, one can still attain the asymptotic efficiency by substantially suppressing this 
factor through: (i) reducing the size of the weight set and (ii) reducing the number of the can¬ 
didate variables, which have been taken by Hansen (2007) and Wan et al. (2010), respectively. 
Unfortunately, the limitations imposed on the size of the weight set or M by these authors are too 
stringent, and hence may lead to suboptimal results, as discussed previously. Theorem 1 takes the 
first approach and provides a somewhat striking result that asymptotically efficient model averaging 
is still achievable under a continuous/uncountable weight set, which is in sharp contrast to Hansen’s 
(2007) discrete/countable weight set. The theoretical underpinnings of Theorem 1 are some sharp 
uniform probability bounds, which are presented in the Appendix and established based on a mild 
lower bound condition on the weight set described in (2.10). 

In the case where S n is unknown, the asymptotic efficiency of C*(w ) developed in Theorem 1 
becomes practically irrelevant. However, if there exists a consistent estimate, E” 1 , of E” 1 , then 
the corresponding FGLS estimator of /i. based on the mth approximating model is P* n Y n , where 
P* n = X m {X' m Pp-X m )~ l . Moreover, the FAMMA criterion, 

M 

C*(w) = (Y n - iY n {w))'t-\Y n - £„(«>)) + 2 ]T w m k m , (2.23) 

m= 1 

can be used in place of C*(w) to perform model averaging, where 

M 

AnM = P*(w)Y n = ]T w m P* m Y n 

m= 1 

is the FGLS model averaging estimator of // n given w. Define 

L n M = (£n(u>) - ~ M„), 
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and 


w n = arg inf C*(w). 
wGHn 


In the next theorem, we shall show that as long as S n 1 converges to Ti n l sufficiently fast in 
terms of spectral norm, (2.11) still holds with L*(iu n ) replaced by L^(w n ). 


Theorem 2. Assume that Assumptions 1 and 2 hold and there exists a sequence of positive numbers 
{b n } satisfying b n = o(n 1//2 ) such that 

n||E- 1 -S- 1 || 2 = O p (^), (2.24) 

where for the p x p matrix A, ||A|| 2 = sup zGi j Pj || z || =1 z KAz with ||z|| denoting the Euclidean norm 
of z. Moreover, suppose that there exists 2 N/S <9 < 1/2 such that 

= 0 a.s . (2.25) 

n—too b* 
n 'n 

Then, 


L„(w r 


inf W £H n L* n ( w ) 


1. 


(2.26) 


Below are some comments regarding Theorem 2. 

Remark 6. In the next section, (2.24) will be established for E” 1 = S” 1 (<?„), where S“ 1 (g n ), 
defined in (3.4), is obtained by the g n -banded Cholesky decomposition of E" 1 with the parameters 
in the Cholesky factors estimated nonparametrically from the least squares residuals of an increasing 
dimensional approximating model. As will be seen later, the order of the magnitude of b n associated 
with ||S“ 1 ((7 n ) — E” 1 !! can vary depending on the strength of the dependence of {et}- 


Remark 7. Zhang, Wan and Zou (2013) considered the model averaging estimator jX n (w) = 
°f Ab wh ere 

AnM = P m Y n (2-27) 

the estimator corresponding to the mth approximating model and P/f, an n x n matrix, depends 
on {x t } only. They evaluated the performance of fi n (w) using the usual squared error loss, 

L n (w) = || ft n {w) - [i || 2 , 


and showed in their Theorem 2.1 that 

inf we g n L n (w) v 


(2.28) 


where is obtained from the JMA criterion (defined in equation (4) of their paper). While the 
weight set in (2.28) is more general than that in (2.26), an assumption similar to (2.20) is required 
in their proof of (2.28). In addition, (2.27), excluding all FGLS estimators (since E” 1 depends on 
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both {x t } and Y n ), can suffer from lack of efficiency in estimating fj,. 

Remark 8. When et are independent random variables with E(e^) = 0 for all t and possibly 
unequal E(ef) = cr 2 , Liu, Okui and Yoshimura (2013, Theorem 4) obtained a weaker version of 
(2.26), 


Lni™n) _^ x 

inf w£W„(iV) L* n ( W ) P 

in which E" 1 = diag(<7 1 ” 2 , • • • , cr“ 2 ) with a^ 2 satisfying 

sup (d t “ 2 - a t -2 ) 2 = O p (n _1 ), (2.29) 

l<t<n 

among other conditions. However, since n _1 is a parametric rate, certain parametric assumptions 
on cr 2 ,1 < t < n, are required to ensure (2.29). In addition, their proof, relying crucially on Theo¬ 
rem 2 of Whittle (1960), is not directly applicable to dependent data. 

Remark 9. Assumption (2.25) is a strengthened version of (2.16). It essentially says that the 
(normalized) estimation error of E” 1 must be dominated by the amount of information contained 
in the candidate models in a certain way. This type of assumption seems indispensable for the 
FAMMA criterion to preserve the features of its infeasible counterpart. 

Remark 10 . Throughout this paper, the only assumption that we impose on {x t } is sup i>lj>1 E(\xtj \ v ) 
< oo for some 2 < v < oo, in addition to the (a.s.) nonsingularity of Xm- Therefore, {xt} can be 
nonrandom, serially independent or serially dependent. 

3 A consistent estimate of S” 1 based on the Cholesky decomposi¬ 
tion. 

In this section, we shall construct a consistent estimator of E” 1 based on its banded Cholesky de¬ 
composition. Note first that according to (2.12), (2.14) and (2.15), e* has an AR(oo) representation, 

OO 

^ — ®t, (3-1) 

i=0 

where ao = 1, YTjLo a j z '' = (J2JLo 7 ^ 0 for all \z\ < 1 and ^ylol a jl < 00 ; see Zygmund 

(1959). If an AR(fc), k > 1, model is used to approximate model (3.1), then the corresponding best 
(in the sense of mean squared error) AR coefficients are given by — (ai(k ),... , a k {k)) , where 

(ai(/c), — ,a k (k))' = argmin (ci )Cfc) / eRfc E(et + cie t _i H- b c k e t _ k ) 2 . 

Define cr 2 = E(et + ai(A;)et_i + • • • + a k (k)et-k) 2 ■ Then, the modified Cholesky decomposition for 
E" 1 is 

S' 1 = T^D" 1 ^, (3.2) 
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where 


D n = diagijo, a?, erf, • • • , cr 2 ^), 
and T n = (tij)i<ij< n is a lower triangular matrix satisfying 


tij — 


0, 

1, 

a i—j (* — f)j 


if i < j; 
if z = j; 

if 2 < i < n, 1 < j < i — 1. 


Since T n and D n may contain too many parameters as compared with n, we are led to consider a 
banded Cholesky decomposition of X“, 

X-\q) = T' n (q)n-\ q )T n (q), (3.3) 


where 1 < q <C n is referred to as the banding parameter, 

D n (q) = diag^jo, af, ■■■ ,a 2 ,--- , cr 2 ), 
and T n (g) = (tij(q))i<ij< n with 


Uj(q) = < 


0, if i < j or {q + 1 < i < n, 1 < j < i — q — 1}; 

1, if* = j; 

di-j(i — 1), if 2 < i < q, 1 < j < i — 1; 
ai-j(q), ii q + 1 < i < n,i — q < j < i — 1. 


To estimate the banded Cholesky factors in (3.3), we first generate the least squares residuals 
e n = (ei,..., e n )' based on the approximating model ^2j=i 6j x t,j f° r (2-1), where d = d n is allowed 
to grow to infinity with n and e n = (/ — Hd)Y n with H^ denoting the orthogonal projection matrix 
for the column space of X(d) = {xij)i<i< n ,i<j<d- Having obtained e n , the 70 and cr 2 {k ) in D n {q) 
and the at{k) in T n (g) can be estimated by 70, d 2 (k), and &i(k), respectively, where for 1 < k < q, 


(di{k) 1 ...,d k {k))' = argmin (ci) Cfc) / Gi?fc ^(e t + cie t _iH-b c k e t _ k ) 2 , 

t=q +1 


7o = n 


-‘E^ 


4=1 


dl = 


(n-q) 1 + 

t=q +1 j=i 


- 3 ) 


Plugging these estimators into D n (g) and T n (q), we obtain D ri (g) and T n (q), and hence an estimator 

of £-\ 

£- 1 ( q ) = T' n (q)n- 1 ( q )T n (q). (3.4) 

Note that we have suppressed the dependence of X" 1 ^) on d in order to simplify notation. The 
next theorem provides a rate of convergence of X" 1 ^) to X" 1 when q = q n and d = d n grow to 
infinity with n at suitable rates. 
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Theorem 3. Assume Assumptions 1 and 2 with (2.13) and (2.15) replaced by 


E(|a t | 2r |-Ft- 1 ) <C 2r < oo a.s. 


where r > 2, and 


j> i 

respectively. Also assume that 

sup E(| x t j | 2r ) < oo. 

Suppose that d n and q n are chosen to satisfy: 

d n x n 1/4 , 


(3.5) 


(3.6) 


(3.7) 


(3.8) 


and 


Then, 


max {d n , q n } ^ \0 3 \ = o(l), 
j>d n 


qld n 

n 


o(l). 


S n (Qn) 'Sin 


Op 



+ 



(3.9) 


(3.10) 


(3.11) 


Remark 11. In the simpler situation where e n = Y n = e n , namely, Ht = 0 for all t, Wu and 
Pourahmadi (2009) proposed a banded covariance matrix estimator S n ./ = (fii-jl\i-j\<i)i<i,j< n of 

T, n , where 7 k = n” 1 ejej + iju is the fcth lag sample ACF of {e*} and l is also called the banding 

parameter. When l = l n = o(n 1//2 ) and (3.5) holds with r = 2, their Theorems 2 and 3 imply that 
S n ,i n is positive definite with probability approaching one, 

||E nj / n — E n || = Op | ^ 1/2 J j (3-12) 

\ j>1n J 

and 

. (3.13) 

McMurry and Politis (2010) generalized (3.12) and (3.13) to tapered covariance matrix estimators. 
Ing, Chiou and Guo (2013) considered estimating E” 1 through the banded Cholesky decomposi¬ 
tion approach in situations where e n is obtained by a correctly specified regression model. They 
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established the consistency of the proposed estimator under spectral norm, even when {e^} is a 
long-memory time series. However, since this section allows the regression model to be misspeci- 
hed, ah the aforementioned results are not directly applicable here. 

Remark 12. The second term on the right-hand side of (3.11) is mainly contributed by the 
approximation error ||S“ 1 (g n ) — E” 1 !!, whereas the first one is mainly due to the sampling variabil¬ 
ity ||S“ 1 (q n ) — £“ 1 (g n )||, which is in turn dominated by ||T n (g n ) — T n (g n )||, as shown in the proof of 
Theorem 3. Similarly, the first and second terms on the right-hand side of (3.12) are contributed by 
II- £n,zjI and ||E njZn - E n ||, respectively. Here, E njZri = (li~jl\i-j\<i n )i<i,j<n is the population 
version of E n ,z„- However, unlike E nj / n — E nj / n , T n (q n ) — T n (q n ) is not a Toeplitz matrix. Hence our 
upper bound for ||T n (g n ) — T n (g n )|| is derived from complicated maximal probability inequalities, 
such as (A.42) and (A.49), which also lead to an additional exponent r" 1 in the first term on the 
right-hand side of (3.11). 

Remark 13. The technical assumptions (3.8)-(3.10) essentially say that the dimension, d n , of 
the working regression model shouldn’t be too large or too small. They ensure that the sampling 
variability and the approximation error introduced by this model are completely absorbed into the 
first or second term on the right-hand side of (3.11), which depend only on the working AR model 
used in the Cholesky decomposition. As shown in the next section, this feature can substantially 
reduce the burden of verifying (2.24) and (2.25). 


4 Asymptotic efficiency of the FAMMA method with S n x = Yi n l (q n ). 

In this section, we shall establish the asymptotic efficiency of C*(w ) with E" 1 = E“ 1 (q n ), denoted 
by C* Qn (w), when the AR coefficients of {et} satisfy 


or 



< Ciexp(-vq), 



< C’zq 


(4.1) 


(4.2) 


for ah q > 1 and some positive constants C\. Cz and v. We call (4.1) the exponential decay case, 
which is fulfilled by any causal and invertible ARMA(p, q) model with 0 < p, q < oo. On the other 
hand, (4.2) is referred to as the algebraic decay case, which is commonly discussed in the context 
of model selection for time series; see Shibata (1981) and Ing and Wei (2003, 2005). 

We first choose suitable q n for E“ 1 ((/ n ) to ensure that the bound in (3.11) possesses the op¬ 
timal rate. When (4.1) is assumed, it is not difficult to see that the optimal rate of (3.11) is 
O p ((log n) 1+r /n 1 / 2 ), which is achieved by 


q n = c 4 log n, 

for some sufficiently large constant c 4 . Therefore, (2.24) holds with 

E” 1 = E“ 1 (c 4 log n) and b n = (logn) 1+r 


(4.3) 


(4.4) 
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When (4.2) is true, by letting 


q n = Ln 1/{2(1+r_1+,/)} J, (4.5) 

where [aj denotes the largest integer < a, we get the optimal rate of (3.11), O p (n~ u ^ 2( - 1+u+r 1 ^), 
yielding that (2.24) holds with 

E- 1 = S“ 1 (Ln 1 / {2(1+r ~ 1+iy )}j) and b n = . (4.6) 

We are ready to establish the asymptotic efficiency of C* qn (w) under (4.1). 

Corollary 1 . Assume Assumptions 1 and 2, (4.1) and (3.7) with 2r replaced by S, noting that 
S is defined in (M3) of Assumption 1. Suppose that d n and q n obey (3.8) and (4.3), respectively. 
Moreover, assume 


d n Y \ G i\ = °( 1 )’ 

j>dn 


and for some 2 N/S < 9 <1/2, 


(log n) 1+ ( 2 / s ) 

;.*( l/2)-fl 
^n 


(4.7) 


(4.8) 


Then, (2.26) holds with w n = argmi w£ -^ N C *n, qn M- 

Corollary 1 follows directly from Theorems 2 and 3 and (4.4) with r^ 1 replaced by 2/S. Its proof 
is thus omitted. Condition (4.8) is easily satisfied when D n (m ) follows (2.21). To see this, note that 
(2.21) implies = csn 1 /( 1+0 ) for some C 5 > 0. Therefore, (4.8) holds for any 2N/S < 9 < 1/2. On 
the other hand, it is not difficult to show that (4.8) is violated when D(m ) = nexp(—cgm) + m for 
some eg > 0, which leads to a much smaller &;* = C7 log n for some C7 > 0. 

To establish the asymptotic efficiency of C* ( w) under (4.2) with v unknown, we need to 
assume that v has a known lower limit > 1/3. 

Corollary 2. Assume Assumptions 1 and 2, (4.2) and (3.7) with 2r replaced by S. Suppose that 
d n obeys (3.8) and q n satisfies (4.5) with r _1 replaced by 2/S and u by vq. Moreover, assume (3.9) 
and for some 2N/S < 9 <1/2, 


1 + (2/S) 

77, 2 [l+ I/ o+( 2 /£)] 

,,*( = 0 a ' S - ( 49 ) 
K n 

Then, the conclusion of Corollary 1 follows. 

Corollary 2 can be proved using Theorems 2 and 3 and (4.6) with r -1 and v replaced by 2 jS 
and z'o, respectively. We again omit the details. Before closing this section, we provide a sufficient 
condition for (4.9) in situations where D n (m ) obeys (2.21). We assume that (2.13) in Assumption 
1 holds for any 0 < S < 00 in order to simplify exposition. Elementary calculations show that (4.9) 
follows from > a. However, since a is in general unknown, our simple and practical guidance for 
verifying (4.9) is to check whether vq > a, where o is a known upper bound for a. 
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5 Concluding remarks 


This paper provides guidance for the model averaging implementation in regression models with 
time series errors. Driven by the efficiency improvement, our goal is to choose the optimal weight 
vector that averages across FGLS estimators obtained from a set of approximating models of the 
true regression function. We propose the FAMMA as the weight selection criterion and show its 
asymptotic optimality in the sense of (2.26). To the best of our knowledge, it is the first time that 
the FGLS-based criterion is proved to have this type of property in the presence of time-dependent 
errors. 

On the other hand, our asymptotic optimality, implicitly involving the search for the averaging 
estimator whose loss (or conditional risk) has the best constant in addition to the best rate, is typi¬ 
cally not achievable when the number of candidate models is large and the models are not necessarily 
nested. While Wan, Zhang and Zou (2010) and Zhang, Wan and Zou (2013) proved the asymp¬ 
totic efficiency of their averaging estimators without assuming nested candidate models, a stringent 
condition on the number of models, e.g., (2.20), is placed as the tradeoff. Furthermore, on top of 
their positive report, no clear guideline for the optimal averaging across arbitrary combinations of 
regressors was offered. In fact, in this more challenging situation, pursuing the minimax optimal 
rate appears to be more relevant than the asymptotic efficiency. The theoretical results developed 
in Wang et al. (2014) and in Sections 2 and 3 provide useful tools for deriving the minimax optimal 
rate under model (2.1). Moreover, motivated by Ing and Lai (2011), we conjecture that when the 
variables are preordered by the orthogonal greedy algorithm (OGA) (see, e.g., Temlyakov (2000) 
and Ing and Lai (2011)), this rate is achievable by FAMMA with 2 replaced by a factor directly 
proportional to the natural logarithm of the number of candidate models. We leave investigations 
along this research direction to future work. 

APPENDIX 

Proof of Lemma 1. Note first that f?*(m) = E £C (L*(m)) = E x (e n P* (w)'E~ 1 P*(w)e rL ) + n' n (I — 
- P*{w))n n - Since 

M 

T.~ 1/2 P*{w) = ]T u; m P m S- 1 /2, (A.l) 

m= 1 


it follows that 

E x (e n P*\w)J:- 1 P*(w)e n ) 

M M 

= My) y^w m wie n E~ 1/2 PiP m E- 1/2 e n ) 

m =1 1=1 
M M 

= ^2 ^2 WlWm k}- 

m= 1 1=1 

Similarly, fj! n {I - P* (to))' E" 1 (I - P*(w))^ n = J2m=l £/=l ^w m ^S“ 1/2 (J - P max{mii })S“ 1/2 /x n . 
Consequently, the desired conclusion (2.2) follows. 

Proof of Theorem 1. Define w* = arg min L* (w), (w n i,... , w n M)' = w n , and (u>* M )' = 

w&Hn ’ 
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w* . By noticing 


C* n (w) - L* n (w) = e n S" 1 e n + 2e n E" i (7 - F»K 

M 

2{e n P,~ 1 P* (w)e n - ^ w mk m }, 


m= 1 


we get 


0 > {C*(w n ) — C*(u>n)} = L* n (w n ) — !/*(■«?*) + 2e^S“ 1 (I — P*(w n )) n 

2 | e n s n lp *(^n)e n - w n>m k m 1 - 2e( l E- 1 (/ - P*(w*J) Hn 
l m= 1 J 

+ 2 | e' n Yi~ l P*[w* n )e n - £ 

l m=l J 

= L n {w n ) — L n (w n ) + 2A n (w n ) 2B n (w n ) — 2A n (w n ) + 2B n (w n ), 


(A.2) 


where A n (io) = e'^X^ 1 (J - P*(w)) /i n and B n (w) = e^X n 1 P*(w)e n - X)m=i In view of 

(A.2) and L^(w n ) > L^(w^), it suffices for (2.11) to show that 


sup 

wGHn 


sup 

w£'H n 


A n (w) 


R* n {w) 

B n (w) 


R* n {w) 


— o p ( 1 ), 

= Op{ 1), 


and 


sup 

wGHn 


L n( w ) _ 1 

R* n {w) 


— °piX)i 


(A.3) 

(A.4) 

(A.5) 


where ——» denotes convergence in probability. 


To show (A.3), first note that 

«(0 = 


u 


B-ji, -,jii 


where for 1 < ,)i < ■ ■ ■ < ji < M, = {w : w E P(i) and Uj i / 0, 1 < i < l}. Hence for any 

e > 0 , 


P x sup 
V w£Hn 


A n (w) 


< 


< 


K ( W ) 

N M 32 -i 

P, 


> £ 


EE- 

-E 

i=l 31=1 

A=i 

N M 

12—1 

EE- 

i=l 31=1 

"X 

11=1 


sup 

w£_V. 


31’-~ ,31 


A n (w) 


> £ 


E. 


- ,n} 


K( w ) 

/ x' n X- 1 /2 (I _P m ) S -l/2 en 


N 


5 2 max D n (m) 
™e{ji>-,ii} 


> e = (A.6) 


l=i 
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where P x {-) = P(- 1a?i,... ,x n ) and the second inequality follows from 


w£H 


inf P*(w) > S 2 max D n {m ), 


n,-di me{ji,- ,ji} 


(A.T) 


which is ensured by Lemma 1 and the definition of T~Ln- Let Si = S/2. Then, by Chebshev’s 
inequality, (M3), and Lemma 2 of Wei (1987), it holds that 


P,r 


y 

/—JV 


,ij} 






\ 


5 2 max D n (m) 
" ,jl} 


> £ 


< 


< 


< 


° E 

c E 

c E 

me{j l,- ,ji} 


\ 


E x {^- 1/2 (I - P m )£~ 1/2 e TO } g i 

{ max D n (m)} Sl 
rne{ji,- ,ji} 


(^E n 1 / 2 (/-P m )E n 1 / 2 /ln ) Sl/2 


max D n (m ) 
rn£{ji,-,ji} 

1 


Si 


< c 


max D n (m ) 

me{j l,— jt} 


51/2 (^(ji)) Sl/2 ’ 


where here and hereafter C denotes a generic positive constant whose value is independent of n and 
may vary at different occurrences. Therefore, for each 1 < l < N, 


32 1 


q, < c\y - y 


1 


M 


32 1 

+ E -E 

31 = 


1 


jl=l jl=l ( k n) Sl/2 ji=k *+1 ji=l ( D n(jl )) Sl/2 




k: 


-(S-l/2-I) 


+ E 


OO 4-1 

h 


■Si/2 


, 


which converges to 0 a.s. in view of (2.16). As a result, 

A n (w) 


sup 

. wGHn 


R* n (w) 


£ I —^ 0, a.s. 


This and the dominated convergence theorem together imply (A.3). 
Similarly, 


P x sup 
V wGHn 


BJw) 


RZ(w) 


N 


>e) <C^E h 


where 


M J2 — 1 


^ = E-E J> . 

31=1 j i=l 


E 


m£{n,- . 1 ;} 


Z=1 


' y-l/2p y-1/2 


&“ max D n (m ) 

wieO'i,- >/(} 


> e 


18 



























By (M3) and the first moment bound theorem of Findley and Wei (1993), it follows that 


< 


P,- 


y 

z—/ri 


■6 0 ' 1 ,— JO 


'v-i/Sp y- l / 2 e 
1 m^n 


\ 


6f max D n (m) 
me{ji,— ,ji} 


c E 

~ ,ji} 


k sl/2 

hjm 


max D n (m ) 
,j t } 


sl< 



Cl 

( DnUl)) Sl/ 2 ‘ 


Therefore, (A.4) follows immediately from an argument similar to that used to prove (A.3). The 
proof of (A.5) is similar to those of (A.3) and (A.4). The details are omitted. 


Proof of Theorem 2. Define L* ( w) = (p* (w) — fi n )' S n 1 [fi* n (w) — /z„). Then, it follows that 


M 


= e, 


C*(w) = (e n -(e n -{fi* n (w) - ti n ))+ 2^2w m k m 

m= 1 
M 

3 n^n e n + ^n( w ) 2 — fi n ) S n e n + 2 ^ 

m=l 

, f M ^ 

+ 2/4 (/ - P*(wj) + L;(m) - 2 j <P*'(«,)£« ^ - £ [ , 

l m= 1 J 


and hence 

o > c < :(th n )-c':(^) 

= L* n {w n ) - L* n (w^) + 2A n (m n ) - 2A n (w„) - 2 B n (w n ) + 2 B n {w^) 

= (P*n(™n) - L%(w n )) - (.L*(to£) - L*(w*)) + 2 i n (m n ) - 2 A n (w^) 
- 2B n (m n ) + 2 B n (wf x ) + {L%(w n ) - L%(w%)), 


where = arg min^g^ L%(w), A n (w ) = (j,' n (I - P*(w 


E n 1 e n and B n [w) = e' n P* (m)E 


-l 


Gri 


Em=l w mkm■ Since L^(w n ) > L^(w£) and (A.3)-(A.5) hold under the assumptions of Theorem 
2 , it suffices for (2.26) to show that 


sup 

w£Hi v 


L*(m) - L*(w) 
R*{w) 


Op(l), 


(A. 8 ) 


sup 

WSlHn 


A n {w) - A n (m) 
R*{w) 


Op(l)j 


(A.9) 


sup 

wSlHn 


B n (w) - £ n (m) 
R*{w) 


Op(l)) 


(A.10) 
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and 


sup 

w£Hi v 


L n(w)~L* n (w) 


= Op( 1 ). 


To prove (A.9), note first that 

An(w)-A n (w) < n' n (I - P*(w))' (E” 1 - S'^en 

+ \n' n (P*(w) - P*(w)) / (S~ 1 - S“ 1 )e n | + | v' n (P*(w) - 
= (1) + (2) + (3). 

Assumption 2 implies 

sup HE" 1 !! < oo and sup ||E„|| < oo, 

n>1 n>1 


(A.ll) 


(A.12) 

(A.13) 


which, together with (2.24), gives 


(1) < CUE- 1 - E” 1 !!||e n || ||E~ 1//2 (A - P>)K 


1/2 , 


= O p (b n )R* n (w), 


where the O p (b n ) term is independent of w. In view of this, (A.7) and (2.25), one obtains 


p' n {i-p*(w)Y( s-i-s- 1 ^ 


R^w) 


sup 

wG'Hn 


= max max sup 

l<KiV l<ji<—<ji<M , J( 

= O p (b n ) * 1/2 = o p (l). 


^(/-^(^/(S^-E-^e, 




(A. 14) 


Let A m = A m E n 1 A m and A m = A m E n 1 A m . Then, straightforward calculations yield for any 
w G W-jx,- .ji with 1 < ji < ■ ■ ■ < ji < M and 1 < l < N, 

(3) < Y K (Z-'X^A- 1 - A^)X' m + (E- 1 - T 1 - 1 )X m A^X' m ) E“ 1 e ri | . (A.15) 

It follows from (A.13) and (2.24) that 

/ —l r-i\ V a—i v> w —1- 


E 


m'JS - 1 - E- i )A m A- i A^E- i e, 


O p {b n ) Y P ™E n - 1/2 e t 


In addition, (A.7) and the first moment bound theorem of Findley and Wei (1993) imply that for 

Si = S/2 > N/0, 


< 


Pt max max 

sup 

l 1<1<N i<ji<-<ji<M w& n h ,. 

N M 

J2 -1 

r . h* Sl/2 h*~ SlB W • • 

■E- 

i= i ji=i 

ji=i j 

n , *Sl/2 ; *-Si9 Si/2+JV 

C ■ k n k n k n 




P E _1 ^ 2 e 

1 m^n 

K 

» 


> k. 




l 


->Si /2 / . 
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Combining the above two equations with (2.25) and the dominated convergence theorem, we get 


X 


max max sup 

!<Z<iV 1 <ji<-<ji<M w& 'Ui 


me{j 1,- ,ji} 


/^(E - 1 - X-^XnA-'X^e, 


,31 

Some algebraic manipulations yield 


R*(w) 


= Op( 1). (A.16) 


Pn^n A^m(A m A m )X m Ti n e n — Op(b n ) y 

,ji} rnG{j!,-,ji} 


P E _1 / 2 e 

1 m^n 


Therefore, by an argument similar to that used to prove (A. 16), 

PnK'XmiA-' - A~l)X' m E-V 


max max sup 

1<1<N l<ji<—<ji<M we'U 


31, "• ,31 


R*(w) 

We conclude from (A. 15), (A. 16) and (A. 17) that 

»’ n (P*(w)-P*(w))' S-'e 

sup - 

w£Hn 


= Op( 1). (A.17) 


R* n (w) 


= o p ( 1). 


(A.18) 


Finally, straightforward calculations and (2.24) yield that for any w £ 'Hj 1 ,— j l with 1 < j\ < 
< jl < M and 1 < l < IV, 

(2) < Ip'n^XmiA- 1 - A£)X' m ( E- 1 - E“ 1 )e 

j'd 

+ ]T I^UE - 1 - E- 1 )X m (A - 1 - A" 1 )A^E - 1 - E“ 1 )e, 

Ji} 

= O p {bl). 


This and (2.25) imply 

n' n {P*{w)-F*{w)y{ S-'-S-^e, 


sup 

w£'H n 


= Op(b 2 Jk * n ) = Op(l). 


72* (m) 


(A.19) 


Now the desired conclusion (A.9) follows from (A. 12), (A. 14), (A.18) and (A.19). The proofs of 
(A.8), (A.10) and (A.11) are similar to that of (A.9). The details are thus skipped. 

Before proving Theorem 3, we need an auxiliary lemma. 

Lemma 2. Assume (2.12), (2.14) and (3.6). Then for any 1 < q < n — 1, 


- ^Vll < C ,/ M 3\ a j\, 

Vi>9+i j>g +i 


(A.20) 


where aj’s are defined as in (3.1). 
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Proof. It follows from (2.12), (2.14), (3.6) and Theorem 3.8.4 of Brillinger (1975) that 


Ei 

3> 1 


ad < oo. 


(A.21) 


In view of (3.2) and (3.3), one has 


HS- 1 ^) - S- 1 !! < ||T n - T n ( (? )||||D- 1 ||||T„|| + ||T n (g)||||D- 1 - D" 1 
+ ||T n (g)|| ||D~ 1 (g)|| ||T n — T n (</)||. 

It is easy to see that 

||D,“ 1 (g)|| < C and ||D" 1 ||<C ( . 

Moreover, by (A. 13) and (A.23) 

||T n || < ||T n D n 1 T ri ||||D n || = ToII 1 II < C, 

and 


(A.22) 


(A.23) 


(A.24) 


D- 1 (g)-D- 1 ||<||D- 1 ||||D- 1 (g)||||D n (g)-D n ||<C'(a 9 2 -a2)<C' Y a 2 . (A.25) 

j>g+ 1 


According to (A.21)-(A.25), it remains to prove that 

||T n (g)-T n || 2 <C ^2 \ aj \ Y jWj\- (A.26) 

j>g +1 

By making use of Theorem 2.2 of Baxter (1962), it can be shown that 

l|T n (g) —T-Joo <C Y M> (A.27) 

j>q +1 

and 

||T n (?)-T n ||!<C Y 3\ a il (A.28) 

j>q+1 

where for an m x n matrix B = (&r/)i<i< m ,i<j<n, ||B||i = maxi<j< n YaLi \^ij\ and 11B||oo = 
maxi< kb^" = i \bij\- The desired conclusion (A.26) now follows from (A.27), (A.28) and ||T re (g) — 
T„|| 2 < ||T n (g) - T„||i||T n (g) - T n ||oo. 


Proof of Theorem 3. We will first show that for each 1 < k < q n 


E 


1 

w 0 


n —1 

Y &t(k)e t+ i : k 

t=q„ 



(A.29) 


where Nq = n-q n , e t (k) = (e t ,... ,e t -k+i)' and e t+ i,fc = e i+ i+a '(k)e t (k) with a '(k) = (ai(k),... ,a k {k)). 
Define e t (k) = (e t ,..., e t _ k+1 )', e t+1 fe = e t+1 + a'(kje t (k) and z n = (z 1 ,...,z n )' = (I - H dn )w(d n ), 
where w (d n ) = (wi(d n ),..., w n (d n ))' = • • • > J2 < jLd n +i 9j x nj)'■ Let {o; = (o u ,..., o ni )', 1 < 
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i < d n } be an orthonormal basis of the column space of X(d n ). Then, it holds that H^ n = YliZi °i°i- 
Moreover, one has for 1 < k < q n , 


j n— 1 

- ]T et(k)e t+ly 


N ° t= qn 
n— 1 ( 


1 n—i i a n j f a n 

— Y <e t (k) + z t (k)-Y v i°t\ k ) \ \ e t+i,k + z t+i,k - Y Vi °t% 

0 t=q n y i= 1 J t i= 1 

1-1 


iU t+l,k 


i =1 


n—1 ^ n —1 d n \ \ n— 1 

^e t (/c)e m>fc + — ^z t (/e)e t+ljfe - ^ o[ t} (L)e t+ i 

t—Qn t = Qn i = l V t—Qn 


Nq *- 

U t=q n u t=q n 

n—1 


i=l 

dri 


t=q n 

n—1 


1 n—1 ^ n—1 d n ( \ n— 1 

°t=(7n i=l { °t=q n 


1-1 


1=1 L U t=9n 

dn dn 


■ w l -E* 


!=1 

dn. f , n—1 

'* I jVn E Z *^^i+l,fc 

i=l l U t=9n 


t—Qn 
.(*) 


^ n—1 

W ^ F Eoi' } (< 

i=lj=l l °t=q n 


+ EE 

Z=1 J = 1 

:= (I) H-h (IX) 


v, -IH ■ - (k)z t (k),Vi = o'e,of ) (A:) = ,o t 

k = ot+i t i + af(k)o[ l \k) . By Lemmas 3 and 4 of Ing and Wei (2003), 


where z t (k) = (z t ,..., z t _ k+ i)', z t+ljk = z t+ 1 + a 

/ / 7 \ Til / 1 \ -r-» ^ 




(A.30) 
and 


Eiki)ir<c(t) 


r/2 


(A.31) 


Theorem 2.2 of Baxter (1962) and (3.6) ensure that the spectral density of et+ i,fc is bounded above, 
and hence by (3.5), Lemma 2 of Wei (1987) and Minkowski’s Inequality, 



for all 0 < l < k — 1. As a result, 


E||(n)ir<c 



(A.32) 


By (A.13), (3.5), the boundedness of the spectral density of et+ i,fc, and Lemma 2 of Wei (1987), 
one has for all 1 < i < d n and all 0 < l < k — 1, 


nvtfr <CVCY° 2 tJ <C, 

t =1 


(A.33) 
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and 


E 


71—1 


— y 

N r ^ 




t—Qn 


2 r 


<Cn- 2r Ei^o 2 ti ) < Cn 


-2 r 


(A.34) 


U=1 


Making use of (A.33), (A.34), the convexity of x r ,x > 0, and the Cauchy-Schwarz inequality, we 
obtain 


E||(III)ir < E ]T| 


s, 2—1 


^ n —1 

-W-J2°t\ k ) e t+lr 


t=q„ 


dn 


< 


«'E E M' 


i =1 
(7 > ? 


n— 1 




< E 


7=1 


0 i=(Jn 
77—1 


A 




2rN 


1/2 


r] r b r / 2 

< c^—. 


n' 


Following an argument similar to that used to prove (A.32), we have for 0 < l < k — 1, 


E 


77—1 


Nr ^2 z t+^et- 


t — Qn 


/ 77 — 1 N 

< Cn- r Ei^ z t+hk 

\t=qn / 


r/2 


< Cn r E ^ (1 + ^2 \aj(k)\) 2 ( Y^ w t( d n 

3 =1 


a=i 


r/2 


(A.35) 


and hence 


< Cn~^ ( |9j, , , 

\j>dn 


E|i(iv)ir<ci ^ r/2 


n 


Ei*ii 

\j>dn 


Similarly, for all 0 < l < k — 1, 
1 


E 


An 


^ ^ Zt—lZt+l,k 


t=q n 


2 r 


< CE ( l^E^n) ) <C(^|%| 

0 1 ' \j>dn 


t= 1 


yielding 


E||(v)ir <c 


r/2 


2r 


n 1/4 X>l 

j>d n 


(A.36) 


(A.37) 
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By an argument analogous to (A.35), it holds that 


E||(vn)ir<c 


dLk r / 2 


n' 


According to (A.33), 


E || (VI) 11 r < Cd^d- 1 ^ E|u,rE 

i =1 { 


U 


^ n— 1 

t\k)z w 

U t=q n 


< Cd r n d-^E 


i= 1 


1 71—1 

— ^2o ( t l \k)z t+hl 


N, 


t=q n 


Moreover, Minkowski’s inequality and the Cauchy-Schwarz inequality yield that for all 1 
and 0 < l < k — 1 , 


E 


1 


71—1 


N °t-l,i z t+l,h 


t=q n 


< 


< 


1 


n— 1 


E llvE»h 


t=q n 


r / 2 / n—1 \ 

1 U 5 z ‘ 2+i l 


r/2 ' 


Cn _r//2 E ^V 

v A ^ t = 


71—1 


r/2 


4+1,i 


< Cn - r / 2 | Y, W 


As a result, 


Eii( vi )ir<c^[,t,5> 


j 


Similarly, it can be shown that 


E||(VHI)|r<C^(d„X> j | 

j>dn 


and 


E||(ix)ir <c 


d 2 Jk r / 2 


n' 


Consequently, (A.29) follows from (A.30)-(A.32) and (A.35)-(A.41). 
Let 


G n = max 

1 <k<q n 


1 


71—1 




t=q n 


Then, for any M > 0, one obtains from (A.29) and Chebyshev’s inequality that 


P Gn>M 


1/2+1 A' 

Hn 


n 


1/2 


= P G r n > M r - q n 


q n \ r / 2 


n 


< 


C 


Qn 


r /2 + 1 yr r . L - , 

q n ivi k=1 


yk-/ 2 < A. 

- M r 


(A.38) 


< i < d n 


(A.39) 


(A.40) 


(A-41) 


25 






















Hence 


max 

1 <k<q n 


j n —1 

— ^e t (fc)e t+1 , A 


t=q n 


Or, 


' 1/2+1/r' 
Hn 


n 


1/2 


In the following, we shall show that 


1 

ivo 


n—1 

t=q n 



°P (1)* 


(A.42) 


(A.43) 


Note first that 


lb _L ^ lb X 

w)]ei(9n)e[(«„) - — Vei(g n )e^(g n ) 


c 


f 1 n_1 

1 NtE l|et(?n) -et(9„ 


C —Qn 
n —1 


/ n-l \ 1/2 / I "- 1 

+ hv^E 1^(9") -ei(9n)|| 2 UrE H e ^ 


1/2 ' 


N 0 


t—Qn 


N 0 


t—Qn 


Straightforward calculations imply 

1 n_1 n 

W^M - et(q n )\\ 2 < C^-\\e n - e n \\ 2 < {e n H dn e n + ||w n (d n )|| 2 } , 

t = Qn 

E ( e 'n H d n e n ) < Cdn , E(||w n (d n )|| 2 ) < nC{Y, j>dn l%l) 2 , and E(A' 0 " 1 ll e *(^)|| 2 ) < Cg n . As a 

result, 


^ n —1 ^ n—1 

^^2e t (q n )e[(q n ) ~ —^2e t (q n )e' t (q : 


iVo 


= O n 


t=q n 


qn dr. 


Tl 


t—qn 
2 


+ 9n[El^'l) ' “fV ' (/ "E 'V 1 )- 


1/2 


■ j i re 1 / 2 

\.3>dn / ]>dn 


Moreover, Lemma 2 of Ing and Wei (2003) yields 


^ n —1 

rw e t{qn) e t(qn) ^q n 


No 


t—Qn 


- °p ( n t/ 2 ) -^W- 


Combining (A.44) and (A.45) leads to the desired conclusion (A.43). 
By making use of (A.42) and (A.43), we next show that 


T n (q n ) - T n {q n ) 


= O 


(ql +Vr ' 

P l 77,1/2 


(A.44) 


(A.45) 


(A.46) 
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It follows from (A.43) and (A. 13) that 


n —1 


lim P{Q n ) = lim P((N Q 1 Y] e t (q n )e' t (q n )) 1 exists) = 1, 

n—¥ oo n—>oo z —' 

t=q n 


(A.47) 


and 


^ n— 1 


-1 


£—Qn 


iQn = o P ( 1 ). 


In addition, an argument given in Proposition 3.1 of Ing, Chiou and Guo (2013) implies 

T n (q n ) ~ T n{q n ) < CqU 2 max ||a(fc) - a(fc)||, 

1 <k<q n 


(A.48) 


(A.49) 


where a (k) = (oi(A:),... , ak{k))'. Since on Q r 


max ||a(/c) — a(A:)|| < 

1 <k<q n 


No 


n —1 N 

t=q n / 


-1 


G n , 


(A.50) 


(A.46) is ensured by (A.42) and (A.47)-(A.50). 
The proof of (3.11) is also reliant on 


( l/r l+(2/r)' 

\i>-\q n )-n- 1 (q n )\\=O p {^ + qn 


which is in turn implied by (A.23) and 

||D n ((/ra) D n (Q , n)|| — Op 

To prove (A.52), note first that on the set Q n , 

max \&l — all 
1 <k<q n 


n 


' 1 A 1 +(2/r)’ 

Hn ^ 'in 


(A.51) 


n 


1/2 


n 


(A.52) 


< max 

l<k<q n 


n— 1 


1 ~2 2 

m 2s et +i,k a k 

V t=q n 


+ 


^ n —1 N 


i — Qn 


Gi 


(A.53) 


Moreover, by (3.8), (3.9), Lemma 6 of Ing and Wei (2005) and an argument similar to that used to 
prove (A.29), it holds that for all 1 < k < q n , 


E 


1 n —1 

— e t 2 +1 fc - a, 2 < Cn~ r/2 , 

N °t= qn 


and hence 


max 

l<k<q n 


71—1 


2 2 
e t+l,fc — 


£—Qn 


O p 


l/r 

qn 

n 1 / 2 



(A.54) 
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Similarly, we have 


17o - To| = Op{n 1/2 ). (A.55) 

Combining (3.10), (A.42), (A.47), (A.48), (A.53)-(A.55) and 

!|D ra (<?n) - Dn(gn)|| = max{| 7 0 - 7 o|, max | g\ - a 2 k \} 

1 <k<q n 

yields the desired conclusion (A.52). The proof is completed by noticing that (3.11) is an immediate 
consequence of (A.20), (A.46) and (A.51). 
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